Description
Inference
Prediction
In your groups: You want to know whether a new anti-acid reduces people’s heartburn.
Think about how you might go about using data to answer this question.
Statistical Inference is using information/data from a sample to draw conclusions about a population
\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).
Statistical Inference is using information/data from a sample to draw conclusions about a population
\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).
For example:
Statistical Inference is using information/data from a sample to draw conclusions about a population
\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).
Populations can be thought of as:
Think about the heights of all people in the USA named Michael.
Group of Existing People: All 3.28M Michaels
DGP for Michaels: the theoretical process that creates Michaels
I claim that I’m faster at crosswords than you. We both do one crossword:
my time: 25m 05s
your time: 25m 23s
Am I right? What would it take to convince you I’m right?
What do I really mean when I say that I’m faster at crosswords that you?
Statistics: are functions of data that summarize the data, \(T(\mathbf{X})\).
Population Statistic: \(T(\mathbf{X})\); where \(\mathbf{X}\) is the random variable (e.g. mean height of people named Michael)
Sample Statistic: \(T(\mathbf{x})\); where \(\mathbf{x}\) is a realized sample of \(\mathbf{X}\) (e.g. mean of 100 randomly sampled heights of people named Michael)
Sample Mean: \(\frac{1}{N} \sum_{i=1}^N x_i\)
(or: 67th percentile, min, max, variance, z-statistic…)
When choosing a statistic, you’re implicitly agreeing that two samples are the same if \(T(\mathbf{x}) = T(\mathbf{y})\)
I claim that my mean crossword time is faster than yours:
\[ \mu_{me} > \mu_{you} \]
Note: here, let’s think of our mean times as the DGP that generates our observed times:
\[ \text{Chelsea}_i \sim N \left (\mu_{me}, \sigma_{me} \right) \\ \text{You}_i \sim N \left (\mu_{you}, \sigma_{you} \right) \]
But I can’t possibly observe \(\mu_{me}\) and \(\mu_{you}\)…
If we were doing description, the sample mean alone would accomplish our goal.
But if doing inference, we want to generalize. We don’t want to know about \(\bar{x}\), we want to know about \(\mu\).
Note: we often use Greek letters to denote population statistics and Roman letters for sample statistics.
New Goal: Formalize a process to get from Sample statistics to Population statistics
e.g. what can I learn about the mean height of college students, \(\mu\) from random sample mean \(\bar{x}\)
via Peter Tennant at https://x.com/PWGTennant/status/1164084443742691328
Estimand: a target quantity to be estimated
Estimator: a function, \(W(\mathbf{x})\) that is a recipe about how to get an estimate from a sample
Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)
Sometimes, finding an estimator feels intuitive (e.g. using sample mean to estimate population mean) but remember, someone at some point had to figure out that the sample mean was a good estimator.
Method of Moments
Maximum Likelihood Estimators
Set the first \(k\) sample moments equal to the first \(k\) population moments, and solving.
\[ \underbrace{m_1}_\text{1st samp moment} = \overbrace{\mu'_1}^\text{1st population moment} \\ m_2 = \mu'_2 \\ ... \\ m_n = \mu'_n \\ \]
Moments of a distribution are expectations.
\[ \mu'_n = \mathbb{E}X^n \]
Central Moments replace \(X\) with the mean centered value \(X-\mu\).
\[ \mu_n = \mathbb{E}(X-\mu)^n \]
Remember:
\(p^{th}\) sample moment: \(\frac{1}{n} \sum_{i=1}^n X_i^p\)
\(p^{th}\) population moment: \(\mathbb{E}(X^p)\)
Let’s say \(x \sim \mathcal{N}(\theta, \sigma^2)\), \(k = 2\)
first moment: \(\frac{1}{n} \sum_{i=1}^n x_i= \mathbb{E}(X) = \mu\)
second moment: \(\frac{1}{n} \sum_{i=1}^n (x_i-\bar{x})^2 = \mathbb{E}(X-\mu)^2 = \mu^2 + \sigma^2\)
\(\frac{1}{n} \sum_{i=1}^n x_i^2 = \bar{x}^2 + \sigma^2\)
\(\hat{\sigma}^2 = \left [\frac{1}{n} \sum_{i=1}^n x_i^2 \right] - \bar{x}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)
okay but that felt too easy…
A gamma distribution has two parameters, \(\alpha\) and \(\theta\). Let’s use MoM to find estimators for them.
\[ f(x; \alpha, \theta) = \frac{1}{\Gamma(\alpha)\theta^\alpha} x^{\alpha-1} e^{-x/\theta} \]
First moment: \[ \mathbb{E}(X_i) = \mu = \alpha\theta \] Second central moment: \[ Var(X_i) = \mathbb{E}\left[ (X_i - \mu)^2\right ] = \alpha\theta^2 \]
Next, we set these equal to the sample moments: \[ \mathbb{E}(X_i) = \mu = \alpha\theta = \underbrace{\frac{1}{n}\sum x_i}_\text{sample mean} = \bar{x} \]
\[ Var(X_i) = \mathbb{E}\left[ (X_i - \mu)^2\right ] = \alpha\theta^2 = \underbrace{\frac{1}{n}\sum (x_i-\bar{x})^2}_\text{sample var} \]
Now solve for each parameter!
Try it yourself!
Now solve for each parameter!
\(\alpha\):
\[ \alpha\theta = \underbrace{\frac{1}{n}\sum x_i}_\text{sample mean} \\ \alpha = \frac{1}{n\theta}\sum x_i = \frac{\bar{x}}{\theta} \]
Now solve for each parameter!
using \(\alpha = \frac{\bar{x}}{\theta}\) to sub into the variance equation…
\[ \alpha\theta^2 = \frac{\bar{x}}{\theta}\theta^2 = \bar{x}\theta = \underbrace{\frac{1}{n}\sum (x_i-\bar{x})^2}_\text{sample var} \rightarrow \\ \hat{\theta} = \underbrace{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2}_\text{MoM estimator} \]
Now plug this back into our original equation to get \(\alpha\):
\[ \hat{\alpha} = \frac{\bar{x}}{\hat{\theta}} = \frac{\bar{x}}{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2} \]
We’ve found the estimators!!!!
\[ \hat{\alpha} = \frac{\bar{x}}{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2} \\ \hat{\theta} = \frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2 \]
hoorah.
\[ \text{arg}\,\max\limits_{\theta} \mathcal{L}(\theta|x) \]
The estimate of \(\theta\) is the one that maximize the likelihood of the data, \(x\).
\[
p(x | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}}
\]
Where \(\theta = (\mu, \sigma)\). We want to choose values of \(\theta\) maximize the likelihood of the data, \(x\)
For a single data point the value of the likelihood function, \(L\left( \theta | x \right)\) is:
\[ \mathcal{L} \left( \theta | x\right) = p(x| \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]
If data points in a sample are independent, the likelihood value for all data points is simply the product of their individual likelihood values, since \(p(A,B) = p(A)*p(B) \text{ iff } A \mathrel{\unicode{x2AEB}} B\).
\[
\mathcal{L}\left(\theta | \mathbf{x} \right) = p(\mathbf{x} |\theta) = \prod_{i=1}^n p(x_i | \theta) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x_i- \mu}{2\sigma^2}}
\]
The higher the likelihood of our data, the more evidence that a particular \(\theta\) is a good fit for the data.
\[ \text{arg max}_{\theta} \left[ \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \right] \]
to maximize, we:
But…
…taking the derivative of a function of products is hard, so we use log likelihood.
\[ \ell\left(\theta | \mathbf{x} \right) = \log\left(\mathcal{L}\left(\theta | \mathbf{x} \right)\right) = \\ -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]
Note: \(\log()\) is a monotonically increasing function, so choosing \(\theta\) that maximizes \(\mathcal{l}\left(\theta | \mathbf{y} \right)\) will also maximize \(\mathcal{L}\left(\theta | \mathbf{y} \right)\)
\[ \ell\left(\theta | \mathbf{x} \right) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]
Example with normal distribution:
\(\hat{\mu} : \frac{\partial}{\partial \mu} \ell(\theta | x) = 0\)
\(\hat{\sigma} : \frac{\partial}{\partial \sigma} \ell(\theta | x) = 0\)
Solution for Normal Distribution:
\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^nx_i\)
\(\hat{\sigma} = \frac{1}{n} \sum_{i=1}^n(x_i - \hat{\mu})^2\)